-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Tune] Enable experiment restore from moved cloud uri #31669
[Tune] Enable experiment restore from moved cloud uri #31669
Conversation
Signed-off-by: Justin Yu <justinvyu@berkeley.edu> Fix constructor args Signed-off-by: Justin Yu <justinvyu@berkeley.edu> Fix all references to _checkpoint_dir Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
…e ckpt dir Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
# since the dir might not be creatable locally. | ||
# TODO(ekl) this is kind of a hack. | ||
if not ray.util.client.ray.is_connected(): | ||
trial.init_logdir() # Create logdir if it does not exist |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I moved this out of Trial.__setstate__
. Are trials serialized/deserialized in any situation that would require this to stay within __setstate__
? My understanding is that trial objects don't get shipped around and stay on the Tune driver. Only serialized/deserialized on experiment checkpoint and restore, which is handled here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the reason for moving this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're correct @justinvyu and I think your solution is much cleaner. For setstate/getstate, it's ok to just restore exactly the state that was saved. If properties are overwritten, that should happen in the function that issues the restore. No need for magic here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@krfricke doesn't this make trials effectively unserializable in the general case unless the trainable is registered? I don't think that's an issue, but perhaps something to consider
Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
…ore_from_moved_cloud_uri
Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
…m trial runner tests Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
…ore_from_moved_cloud_uri
Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is great, thanks! Only a couple of nits
# since the dir might not be creatable locally. | ||
# TODO(ekl) this is kind of a hack. | ||
if not ray.util.client.ray.is_connected(): | ||
trial.init_logdir() # Create logdir if it does not exist |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're correct @justinvyu and I think your solution is much cleaner. For setstate/getstate, it's ok to just restore exactly the state that was saved. If properties are overwritten, that should happen in the function that issues the restore. No need for magic here.
…ore_from_moved_cloud_uri
Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
…y-project#32010) ray-project#31669 changed the `Trial.__dict__` by moving `local_dir` to `_local_dir`, which resulted in an error in our tune cloud tests. This PR updates the signature of the `TrialStub` class to resolve the issue. Signed-off-by: Kai Fricke <kai@anyscale.com> Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>
…th (#40647) This is a known regression introduced in 2.7: moving the path of the experiment directory and attempting to restore the experiment and/or the experiment results doesn't work due to the absolute paths saved in the trial metadata. This PR implements a fix similar to #31669 -- replacing the root of the tracked checkpoint paths with the new storage path, and updating on experiment restoration / result loading from a path. --------- Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Why are these changes needed?
This PR fixes experiment restoration from a different cloud URI to save future results and checkpoints to new URI instead of continuing to write to the old location. The workflow of starting a local experiment, uploading the experiment dir to cloud, then restoring from the URI from a different cluster is also possible now.
Example:
This is a follow-up to #29920, which enabled experiment restoration from a moved local experiment directory.
I also took the chance to simplify some of the trial loading/serialization logic.
Related issue number
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.